This work is based on the "Individual household electric power consumption Data Set" from UCI Machine Learning repository. Per the description, measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.
The goal is a hypothetical addition of a solar power system to the house. The sizing of the system will minimize grid reliance in most situations. For simplicity, a few rough assumptions are introduced to simplify this preliminary analysis: namely, the sun is always shining (days and nights) and all the power from the solar panels can be used by the appliances.
To size the system, it is important to model the power consumption of the entire household and understanding trends. Finally, a six-months prediction of the power consumption is provided.
Attribute info (from UCI Machine Learning repository).
1.date: Date in format dd/mm/yyyy 2.time: time in format hh:mm:ss 3.global_active_power: household global minute-averaged active power (in kilowatt) 4.global_reactive_power: household global minute-averaged reactive power (in kilowatt) 5.voltage: minute-averaged voltage (in volt) 6.global_intensity: household global minute-averaged current intensity (in ampere) 7.sub_metering_1: energy sub-metering No. 1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered). 8.sub_metering_2: energy sub-metering No. 2 (in watt-hour of active energy). It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light. 9.sub_metering_3: energy sub-metering No. 3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-conditioner.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2075259 entries, 0 to 2075258 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 date datetime64[ns] 1 Global_active_power float64 2 Global_reactive_power float64 3 Voltage float64 4 Global_intensity float64 5 Sub_metering_1 float64 6 Sub_metering_2 float64 7 Sub_metering_3 float64 dtypes: datetime64[ns](1), float64(7) memory usage: 126.7 MB
None
Null values are represented by '?' in the original csv file. Approximately 1.25% of the rows have missing values. The values are distributed along the dates, with some agglomeration in the final dates. Depending on the time aggregation (hours, days, weeks, or months) a proper treating of the missing values is required.
Null values per column:
| null_values | null_value_percent | |
|---|---|---|
| date | 0 | 0.000000 |
| Global_active_power | 25979 | 1.251844 |
| Global_reactive_power | 25979 | 1.251844 |
| Voltage | 25979 | 1.251844 |
| Global_intensity | 25979 | 1.251844 |
| Sub_metering_1 | 25979 | 1.251844 |
| Sub_metering_2 | 25979 | 1.251844 |
| Sub_metering_3 | 25979 | 1.251844 |
<seaborn.axisgrid.FacetGrid at 0x2a93a695ca0>
Dates in this dataset have temporal resolution of 1 minute.
Number of dates: 2075259, Unique: 2075259
min 2006-12-16 17:24:00 max 2010-12-11 23:59:00 data_unique_values 2075259 expected_values 2097035 dtype: object
No duplicated dates Original dataset is missing 21776 dates
No duplicated index dates
| missing_minutes | |
|---|---|
| dates | |
| 2010-01-12 | 1440 |
| 2010-02-12 | 1440 |
| 2010-03-12 | 1440 |
| 2010-04-12 | 1440 |
| 2010-05-12 | 1440 |
| 2010-06-12 | 1440 |
| 2010-07-12 | 1440 |
| 2010-08-12 | 1440 |
| 2010-09-12 | 1440 |
| 2010-10-12 | 1440 |
| 2010-11-12 | 1440 |
| 2010-11-26 | 177 |
| 2010-11-27 | 1440 |
| 2010-11-28 | 1440 |
| 2010-11-29 | 1440 |
| 2010-11-30 | 1440 |
As many values are missing after November 26th 2010, only the values before Nov-27-2010 are considered.
As we preprocessed the dataset earlier, there should not be any.
Null values per column:
| null_values | null_value_percent | |
|---|---|---|
| Global_active_power | 0 | 0.0 |
| Global_reactive_power | 0 | 0.0 |
| Voltage | 0 | 0.0 |
| Global_intensity | 0 | 0.0 |
| Sub_metering_1 | 0 | 0.0 |
| Sub_metering_2 | 0 | 0.0 |
| Sub_metering_3 | 0 | 0.0 |
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Global_active_power | 2059419.0 | 1.089090 | 1.053636 | 0.076 | 0.310 | 0.61 | 1.526 | 11.122 |
| Global_reactive_power | 2059419.0 | 0.123539 | 0.112450 | 0.000 | 0.048 | 0.10 | 0.194 | 1.390 |
| Voltage | 2059419.0 | 240.827439 | 3.239969 | 223.200 | 238.980 | 241.00 | 242.870 | 254.150 |
| Global_intensity | 2059419.0 | 4.616927 | 4.428958 | 0.200 | 1.400 | 2.60 | 6.400 | 48.400 |
| Sub_metering_1 | 2059419.0 | 1.109423 | 6.121396 | 0.000 | 0.000 | 0.00 | 0.000 | 88.000 |
| Sub_metering_2 | 2059419.0 | 1.289165 | 5.793815 | 0.000 | 0.000 | 0.00 | 1.000 | 80.000 |
| Sub_metering_3 | 2059419.0 | 6.418500 | 8.411376 | 0.000 | 0.000 | 1.00 | 17.000 | 31.000 |
The main goal for this analysis is the forecast of Global_active_power, the overall power consumption of the household. The three sub-metering are independent of each other, while Global_active_power is significantly correlated (especially to sub_metering_3). The Global_active_power is also correlated to Global_intensity: as power is calculated as the product of voltage (fairly constant) and current, this is expected. Voltage is an uninformative variable in this case.
Most correlated pairs:
Global_active_power Global_intensity 0.998893
Global_intensity Global_active_power 0.998893
Global_active_power Sub_metering_3 0.638764
Sub_metering_3 Global_active_power 0.638764
Global_intensity 0.626822
Global_intensity Sub_metering_3 0.626822
dtype: float64
The goal is to predict Global_active_power for the next months. We are going to compute aggregations on the dataset, since the temporal resolution of the dataset (minutes) is too fine for a long term prediction. Based on the goal of the analysis and the shape of the Global_active_power distribution, a choice must be made for the aggregation function that will be used.
Goal: compute the amount of power needed to size a solar panel system that would satisfy most of the household energy needs. Assumptions: sun is always shining, and all solar power from the panels is available to use. Shape of the distribution: bimodal with long tail on the right side.
<AxesSubplot:xlabel='Global_active_power', ylabel='Count'>
For this work, the monthly 99th percentile is chosen as aggregation function (hence also predicted). The variable distribution is gaussian, and captures most of the monthly maximum usage needed to size the panel system. For the remaining 1% we can rely on the traditional electric grid.
Exclude data before July 2007, as it has relatively high magnitude with respect to the remaining data.
The data is non-stationary: the mean value is not constant, there is a negative trend present, and seasonality is visible. It is not clear whether there is heteroscedasticity. The data is autocorrelated as well.
Splitting the monthly data into 10 chunks, we can see that the variance across chunks is not constant (a sign of non-stationariety).
| mean | var | |
|---|---|---|
| 0 | 5.509652 | 0.152727 |
| 1 | 4.780756 | 0.026575 |
| 2 | 5.313856 | 0.184729 |
| 3 | 4.711944 | 0.071158 |
| 4 | 4.840024 | 0.087466 |
| 5 | 4.632652 | 0.213626 |
| 6 | 4.267048 | 0.125345 |
| 7 | 4.952680 | 0.094290 |
| 8 | 3.892915 | 0.079233 |
| 9 | 4.246335 | 0.113754 |
According to the Augmented Dickey-Fuller test, the series is stationary. However, seasonality is clear, as well as trend, and this underlines the importance of visual inspection. A different choice of lag in the DF test function would likely give a different result.
Test statistic: -3.230, pvalue: 0.018
Critical values: {'1%': -3.5778480370438146, '5%': -2.925338105429433, '10%': -2.6007735310095064}
Observations: 47, Used Lag: 0
If we difference the data (takes the difference between one point and the next) and we analyze the resulting series, we notice that the result does not show any pattern, and the autocorrelation and partial autocorrelation values are all within the boundary of statistical non-significance. The first lag in ACF is negative, but not below -0.5. This means that the series does not need any more differentiating, and the differencing order d is 1. In general, d=1 means that the original data presents a constant average trend, which is the case for the data in exam.
Standard deviation (original time series): 0.57370 Standard deviation (diff=1): 0.40261
ACF represents the correlation between the current value of the series and future ones. There is always a cross-over between successive lags in the ACF plot. In this case, it is likely that only the first value might be significant.
There is a sharp cutoff (while ACF decays more slowly), so the series can be well represented by an AR model. Only the previous value (lag=1) appears to be significant for the prediction. This translates into an AR(1) model. However, an AR(1) model is equivalent to taking a first difference (d=1, as seen above). The original series is hence under-differenced.
There is a clear seasonal effect, with lower power consumption in summer months.
Use statsmodels.tsa.seasonal.seasonal_decompose to analyze the time series with an additive model with period of 12 months. The seasonality pattern in the data is captured very well, same for the trend. The residual should be white noise, and they can be tested for stationariety with the Augmented Dickey-Fuller test.
Residual are normally distributed and Augmented Dickey-Fuller is positive for a stationary series.
Test statistic: -4.285, pvalue: 0.00047
Critical values: {'1%': -3.661428725118324, '5%': -2.960525341210433, '10%': -2.6193188033298647}
Observations: 31, Used Lag: 4
As we saw that the data presents both trend and seasonality, a triple exponential smoothing is a more appropriate model rather than single or double exponential smoothing models. In the plot, the solid blue line is the training data, the dashed line is the test data, and the green line represents the 6 months prediction on the test set.
Mean squared error: 0.06710, Mean abs percentage error: 0.05696
The choice of an autoregressive (AR) model (of order p) and/or a moving average (MA) model (of order q) can be carried out using the Box-Jenkins Methodology. From ACF and PACF plots:
It is estimated that for my data an AR model with $p=1$ is appropriate.
ARMA model include an autoregressive (AR) and a moving average (MA) part. As the data is likely well represented by an ARMA(p,0) model, I'm going to start from an AR(1) model.
| Dep. Variable: | Global_active_power | No. Observations: | 48 |
|---|---|---|---|
| Model: | ARIMA(1, 0, 0) | Log Likelihood | -22.163 |
| Date: | Sat, 22 Jan 2022 | AIC | 50.325 |
| Time: | 12:03:56 | BIC | 55.939 |
| Sample: | 12-31-2006 | HQIC | 52.447 |
| - 11-30-2010 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 4.8085 | 0.203 | 23.741 | 0.000 | 4.411 | 5.205 |
| ar.L1 | 0.7749 | 0.087 | 8.920 | 0.000 | 0.605 | 0.945 |
| sigma2 | 0.1446 | 0.033 | 4.450 | 0.000 | 0.081 | 0.208 |
| Ljung-Box (L1) (Q): | 0.20 | Jarque-Bera (JB): | 0.21 |
|---|---|---|---|
| Prob(Q): | 0.65 | Prob(JB): | 0.90 |
| Heteroskedasticity (H): | 1.11 | Skew: | -0.04 |
| Prob(H) (two-sided): | 0.84 | Kurtosis: | 2.68 |
Residuals from the ARMA model look normally distributed and without any auto-correlation.
Mean squared error: 0.21626, Mean abs percentage error: 0.10408
With a SARIMA model we can introduce (over the ARMA model) a seasonal component, with its own (P,D,Q,s) parameters.
For SARIMA prediction models, it is important to analyze the seasonal component only for auto-correlation. It seems that only the first lag is important for prediction, again a seasonal SAR(1) model, this time for the seasonal component only. Once again, it is confirmed by differencing the seasonal time series (12 months difference).
With differencing:
The parameters for the fit:
A linear trend is added (trend='t') to account for the linear trend (it could also be achieved using d=1).
Mean squared error: 0.06183, Mean abs percentage error: 0.05252
| Dep. Variable: | Global_active_power | No. Observations: | 42 |
|---|---|---|---|
| Model: | SARIMAX(1, 0, 0)x(1, 1, 0, 12) | Log Likelihood | -8.349 |
| Date: | Sat, 22 Jan 2022 | AIC | 24.697 |
| Time: | 12:03:59 | BIC | 30.302 |
| Sample: | 12-31-2006 | HQIC | 26.490 |
| - 05-31-2010 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| drift | -0.0095 | 0.004 | -2.117 | 0.034 | -0.018 | -0.001 |
| ar.L1 | 0.2339 | 0.231 | 1.013 | 0.311 | -0.219 | 0.686 |
| ar.S.L12 | -0.4719 | 0.299 | -1.577 | 0.115 | -1.059 | 0.115 |
| sigma2 | 0.0922 | 0.025 | 3.742 | 0.000 | 0.044 | 0.140 |
| Ljung-Box (L1) (Q): | 0.00 | Jarque-Bera (JB): | 3.29 |
|---|---|---|---|
| Prob(Q): | 0.95 | Prob(JB): | 0.19 |
| Heteroskedasticity (H): | 2.07 | Skew: | 0.72 |
| Prob(H) (two-sided): | 0.27 | Kurtosis: | 3.73 |
Performing stepwise search to minimize aic ARIMA(1,0,1)(0,1,1)[12] intercept : AIC=inf, Time=0.25 sec ARIMA(0,0,0)(0,1,0)[12] intercept : AIC=22.539, Time=0.01 sec ARIMA(1,0,0)(1,1,0)[12] intercept : AIC=20.619, Time=0.10 sec ARIMA(0,0,1)(0,1,1)[12] intercept : AIC=inf, Time=0.16 sec ARIMA(0,0,0)(0,1,0)[12] : AIC=35.875, Time=0.01 sec ARIMA(1,0,0)(0,1,0)[12] intercept : AIC=24.103, Time=0.02 sec ARIMA(1,0,0)(2,1,0)[12] intercept : AIC=19.572, Time=0.28 sec ARIMA(1,0,0)(2,1,1)[12] intercept : AIC=21.572, Time=0.29 sec ARIMA(1,0,0)(1,1,1)[12] intercept : AIC=inf, Time=0.27 sec ARIMA(0,0,0)(2,1,0)[12] intercept : AIC=17.646, Time=0.20 sec ARIMA(0,0,0)(1,1,0)[12] intercept : AIC=18.920, Time=0.05 sec ARIMA(0,0,0)(2,1,1)[12] intercept : AIC=19.646, Time=0.22 sec ARIMA(0,0,0)(1,1,1)[12] intercept : AIC=inf, Time=0.31 sec ARIMA(0,0,1)(2,1,0)[12] intercept : AIC=19.452, Time=0.27 sec ARIMA(1,0,1)(2,1,0)[12] intercept : AIC=inf, Time=0.57 sec ARIMA(0,0,0)(2,1,0)[12] : AIC=39.479, Time=0.10 sec Best model: ARIMA(0,0,0)(2,1,0)[12] intercept Total fit time: 3.109 seconds
| Dep. Variable: | y | No. Observations: | 42 |
|---|---|---|---|
| Model: | SARIMAX(2, 1, 0, 12) | Log Likelihood | -4.823 |
| Date: | Sat, 22 Jan 2022 | AIC | 17.646 |
| Time: | 12:04:04 | BIC | 23.251 |
| Sample: | 0 | HQIC | 19.439 |
| - 42 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| intercept | -0.6674 | 0.133 | -4.999 | 0.000 | -0.929 | -0.406 |
| ar.S.L12 | -0.7443 | 0.335 | -2.219 | 0.026 | -1.402 | -0.087 |
| ar.S.L24 | -0.6821 | 0.512 | -1.332 | 0.183 | -1.686 | 0.322 |
| sigma2 | 0.0449 | 0.047 | 0.962 | 0.336 | -0.047 | 0.136 |
| Ljung-Box (L1) (Q): | 0.00 | Jarque-Bera (JB): | 2.12 |
|---|---|---|---|
| Prob(Q): | 0.97 | Prob(JB): | 0.35 |
| Heteroskedasticity (H): | 1.99 | Skew: | 0.62 |
| Prob(H) (two-sided): | 0.29 | Kurtosis: | 3.41 |
auto-fit order: (0, 0, 0) auto-fit seasonal_order: (2, 1, 0, 12)
Mean squared error: 0.08415, Mean abs percentage error: 0.06593
ARIMA(order=(0, 0, 0), scoring_args={}, seasonal_order=(2, 1, 0, 12),
suppress_warnings=True)
To compare models, it is best to use a cross-validation split, rather than just the performance on a test set. In time-series, cross-validation works differently as we need to take into account the sequence of events. I will use the TimeSeriesSplit function of Sci-kit Learn. I will compare the performance on 3 splits with a test size of 3, as not too many data points are available.
Train: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38] Test: [39 40 41] Train: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41] Test: [42 43 44] Train: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44] Test: [45 46 47]
The three models (Triple Exponential Smoothing, SARIMA 100x110-12, ARIMA 000x210-12) offer similar performance of the cross-validated test set. The ARIMA model has the lowest performance (maybe due to the automatic fit with not many data points available), followed by the exponential smoothing, and finally the SARIMA model. It is remarkable how a relatively simple Triple Exponential Smoothing model is able to compete with the SARIMA model under this case study.
| date | real | sarima | auto_arima | triple_smoothing | |
|---|---|---|---|---|---|
| 0 | 2010-03-31 | 4.51406 | 4.389967 | 4.494366 | 4.405448 |
| 1 | 2010-04-30 | 4.01282 | 4.116398 | 4.208333 | 4.192382 |
| 2 | 2010-05-31 | 4.11602 | 4.042717 | 4.075042 | 4.139283 |
| 3 | 2010-06-30 | 4.03282 | 3.743101 | 3.732752 | 3.830997 |
| 4 | 2010-07-31 | 3.41000 | 3.623548 | 3.828459 | 3.770710 |
| 5 | 2010-08-31 | 3.97004 | 3.870919 | 3.934475 | 3.949263 |
| 6 | 2010-09-30 | 3.87928 | 3.885768 | 4.104954 | 4.060887 |
| 7 | 2010-10-31 | 4.70402 | 4.342627 | 4.483800 | 4.492324 |
| 8 | 2010-11-30 | 4.43200 | 4.713179 | 4.727936 | 4.816216 |
MAPE: Triple Smoothing: 0.046 Auto ARIMA: 0.049 SARIMA: 0.041 MSE: Triple Smoothing: 0.049 Auto ARIMA: 0.055 SARIMA: 0.042
The three models give similar prediction, with the same trend of decreasing demand moving towards summer. The SARIMAX model is the least conservative, predicting lower amount of power needed with respect to the other models.
ARIMA(order=(0, 0, 0), seasonal_order=(2, 1, 0, 12))
Several time series modeling techniques were chosen to model the monthly 99th percentile of Global_active_power. A SARIMA model was chosen to best approximate and predict the future behavior of the series. The SARIMA model captured both the long term trend, and the seasonality. The 99MPC is decreasing in time, and 5.5kW is the selected solar panels system size (considering the notable assumption of sun always shining and all solar panels power being available for appliances). Given the decreasing trend of the series, a 6 months forecast is not useful in determining the size of the solar power system.